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Many model search strategies involve trading off model fit with 
model complexity in a penalized goodness of fit measure. Asymp- 
totic properties for these types of procedures in settings like linear 
regression and ARMA time series have been studied, but these do 
not naturally extend to nonstandard situations such as mixed effects 
models, where simple definition of the sample size is not meaning- 
ful. This paper introduces a new class of strategies, known as fence 
methods, for mixed model selection, which includes linear and gener- 
alized linear mixed models. The idea involves a procedure to isolate a 
subgroup of what are known as correct models (of which the optimal 
model is a member) . This is accomplished by constructing a statisti- 
cal fence, or barrier, to carefully eliminate incorrect models. Once the 
fence is constructed, the optimal model is selected from among those 
within the fence according to a criterion which can be made flexible. 
In addition, we propose two variations of the fence. The first is a 
stepwise procedure to handle situations of many predictors; the sec- 
ond is an adaptive approach for choosing a tuning constant. We give 
sufficient conditions for consistency of fence and its variations, a de- 
sirable property for a good model selection procedure. The methods 
are illustrated through simulation studies and real data analysis. 

1. Introduction. On the morning of March 16, 1971, Hirotugu Akaike, 
as he was taking a seat on a commuter train, came out with the idea of a 
connection between the relative Kuhback-Leibler discrepancy and the em- 
pirical log-likelihood function, a procedure that was later named Akaike's 
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information criterion, or AIC (Akaike [1, 2]; see Bozdogan [5] for the his- 
torical note). The idea has allowed major advances in model selection and 
related fields (e.g., de Leeuw [7]). 

The procedure essentially amounts to minimizing a criterion function of 
the following form: 

(1) bM + Xn\Ml 

where M represents a candidate model. Dm is a measure of lack of fit by 
M and \M\ denotes the dimension of M, usually in terms of the number 
of estimated parameters under M. The main difference between procedures 
is made by A„, where n is the sample size. This is called a "penalizer," 
although some authors refer A^|iVf| as the penalizer. For example, Aj^ — 2 
for AIC. A number of similar criteria have since been proposed, including 
the Bayesian information criterion (BIC; Schwarz [26]) in which A^ = log(n), 
and a criterion due to Hannan and Quinn (HQ; Hannan and Quinn [10]) 
in which A^ = clog{log(n)} and c is a constant > 2. All these procedures 
can be viewed as special cases of the generalized information criterion (GIC; 
Nishii [21], Shibata [27]). A nice monograph on model selection from various 
perspectives is edited by Lahiri [18]. 

Although these criteria are widely used, difficulties are often encountered, 
especially in some nonconventional situations. A broad class of such noncon- 
ventional cases are mixed effects models, including linear and generalized lin- 
ear mixed models. For example, consider the following linear mixed model, 
Uij = x'^jP + Ui + Vj + eij , i = 1, . . . , mi, j = 1, . . . , m2, where Xij is a vector 
of known covariates, /3 is a vector of unknown regression coefficients (the 
fixed effects), Ui, vj are random effects and Cj-,- is an additional error. It is 
assumed that Uj's, vj's and e^j's are independent, and that for the moment, 
Ui ~ N{0,al), Vj ~ N{0,ay), Cij ~ N{0,al). It is well known (e.g.. Hartley 
and Rao [11], Harville [13], Miller [20]) that in this case, the effective sample 
size for estimating o"^ and o"^ is not the total sample size mi ■ m2, but mi 
and m2, respectively. Now suppose that one wishes to select the fixed covari- 
ates, which are components of Xij under the assumed model structure using 
BIC. It is not clear what should be in place of n in (1), where A^ = log(n) 
(it does not make sense to let n = mi ■ m2). In fact, in cases of correlated 
observations, such as the example here, the definition of "sample size" is 
often unclear. 

Furthermore, suppose that normality is not assumed in the above linear 
mixed model. In fact, the only distributional assumptions are that the ran- 
dom effects and errors are independent, and that they have means zero and 
variances a^, and fjg, respectively. Now suppose that one, again, wishes 
to select the fixed covariates using AIC, BIC or HQ. It is not clear how to 
do this because all three require the likelihood function in order to evaluate 
Dm. 
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In a way, model selection and estimation are two components of a process 
called model identification. While there is extensive literature on parameter 
estimation in linear and generalized linear mixed models, the other com- 
ponent, that is, mixed model selection, has received much less attention. 
Only recently have some results emerged in the area of linear mixed model 
selection. Datta and Lahiri [6] discussed a model selection method based on 
computation of the frequentist's Bayes factor in choosing between a fixed 
effects model and a random effects model. They focused on a one-way ran- 
dom effects model and noted a connection between the choice of fixed or 
random effects models and test of the hypothesis that the variance of the 
random effects is zero. Note that, however, not all model selection problems 
can be formulated as hypothesis testing. Jiang and Rao [16] developed vari- 
ous GICs suitable for linear mixed model selection and proved consistency of 
their procedures. Meza and Lahiri [19] demonstrated the limitations of Mal- 
lows' Cp statistic in selecting the fixed covariates in a nested error regression 
model which is a special case of the linear mixed models. They showed by 
simulation results that the Cp method without modification does not work 
well when the variance of the random effects is large; on the other hand, 
a modified Cp criterion obtained by adjusting the intra-cluster correlations 
performs similarly as the Cp in regression settings. Fabrizi and Lahiri [9] de- 
veloped a robust model selection method in the context of complex surveys. 
Another related paper is Vaida and Blanchard [28], in which the authors 
proposed a conditional AIC where the penalty term is related to the effec- 
tive degrees of freedom for a linear mixed model proposed by Hodges and 
Sargent [14]. 

It should be pointed out that all these studies are limited to linear mixed 
models, while model selection in generalized linear mixed models (GLMMs) 
has never been seriously addressed in the literature. It is well known that 
the likelihood function under a GLMM may involve high-dimensional in- 
tegrals which are difficult to evaluate, which makes a procedure based on 
(1) computationally unattractive. Furthermore, our simulation results sug- 
gested that in the case of GLMM selection, a GIC procedure is much more 
sensitive to the choice of A„ than in linear mixed model selection. 

In summary, the major concerns regarding the GIC procedures when ap- 
plied to mixed model selection are: (i) they depend on the effective sample 
size which is unclear in typical situations of mixed models; (ii) they rely on 
the likelihood function which may not be available; (iii) they do not seem 
applicable to GLMMs; and (iv) their finite sample performance may be sen- 
sitive to different choices of penalties. These motivate the development of a 
new procedure for mixed model selection, called the fence method, which we 
describe in detail in the next section. In Section 3, we propose two variations 
of the fence method. The first is a stepwise fence procedure; the second is 
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an adaptive fence procedure. In Section 4, we address the issue of consis- 
tency of different fence methods. In Section 5, we present some examples of 
simulations and real data analysis. Some concluding remarks are made in 
Section 6. Proofs of the main results are given in Section 7. 

2. The fence method. It is illustrative to first consider a simple example. 
Suppose that the observations yij satisfy the following linear mixed model, 

(2) Vij =XijP + ai + eij, 

i = I, . . . ,m, j = 1, . . . ,ni, where Xij is a vector of covariates, /3 is a vector 
of unknown regression coefficients, ai is a random effect and £ij is an error. 
It is assumed that the random effects and errors are independent such that 
E(aj) = 0, var(aj) = o"^, E(ejj) = and var(ejj) = r^. Even for this simple 
model, there are various model selection problems. For example, the selection 
of the fixed covariates; whether or not to include the random effects, etc. 
Our strategy is based on a quantity Qm = QMiy,SM), where y represents 
the vector of observations, M indicates a candidate model and 9m denotes 
the vector of parameters under M. It is required that ^{Qm) is minimized 
when M is a true model and 6m the true parameter vector under M. This 
means that Qm is a measure of lack-of-fit. Here by true model, we mean 
that M is a correct model, but not necessarily the most efficient one. For 
example, suppose that yij satisfy (2) with x'^jP = /3o + Pixuj + P2X2ij, where 
all the /3's are nonzero. Then for the problem of selecting the fixed covariates, 
this model is optimal in the sense that the number of fixed covariates cannot 
be further reduced. However, the model remains true if in (2) x[jP = /3o + 
Pixuj + l32X2ij + PaXij^j (with = 0). But the latter model is not optimal. 
On the other hand, the model with x[j(3 = Po + Pixuj in (2) is an incorrect 
model. In this paper, we use the terms "true model" and "correct model" 
interchangeably. Below are some options for Qm under different situations: 

1. Maximum likelihood (ML) model selection. If the normality assumption 
is made regarding the random effects and errors, an example of Qm is the 
negative of the log-likelihood under M. 

2. Mean and variance/covariance {MVC) model selection. Suppose that 
the situation is a bit more complicated. First, the errors are correlated within 
the clusters with some (parametric) covariance structure. Second, the nor- 
mality assumption is not made. In such a case, the likelihood function is not 
available. However, one may consider Qm = \ {T'V^j^T)~^T'V^j^ {y — hm)\'^ , 
where /xm and Vm are the mean vector and covariance matrix under M, and 
T is a (not necessarily square) matrix of full rank. Note that in this case, 
fiM = XmPm, where Xm is the matrix of covariates under M and Pm the 
vector of regression coefficients under M. Thus, such a Qm may be used to 
select the fixed covariates as well as the (parametric) covariance structure. 
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A special case of the MVC is least squares (LS) model selection, in which 
T = I, the identity matrix, hence Qm = \y — XmPm\'^- The latter is useful, 
for example, if only the fixed covariates are subject to selection while the 
covariance structure of the data is unknown. 

It is easy to verify that all the Qm above satisfy the basic requirement of 
lack-of-fit above. Other choices of Qm are considered in Jiang et al. [17]. 

2.1. Building the fence. Given a specific Qm, let Qm = Qhiiu-, Gm), where 
9m is the minimizer of Qm over 9m G @m, the parameter space under 
M, that is, Qm = iT^ie^,eeM QM{9M,y)- Let M e M he such that Q^:/ = 
minA,/GA4 Qm, where represents the set of candidate models. We assume 
that Ai contains a true model. Note that in many cases, M can be deter- 
mined without any calculation. For example, if Ai contains a full model, 
say Mf, that is, a model such that all other models in A4 are submodels of 
Mf , then clearly, M = Mf and, since A4 contains a true model, Mf is also a 
true model. In general, Ai may not contain a full model, but the following 
lemma shows that at least in large sample, M is expected to be a correct 
model. 

Lemma 1. Under Assumptions A1-A5 in Section 4, we have with prob- 
ability tending to one that M is a true model. 

The proof of Lemma 1 follows directly from that of Theorem 1 in the 
sequel. 

However, the main question is, "Are there other correct models with 
smaller dimension than M?" This is where the fence idea comes in. As 
mentioned, the idea is to construct a statistical barrier, called the fence, to 
carefully eliminate incorrect models. Then for the models within the fence 
which are considered correct, one may use whatever criterion of optimality 
to select the optimal model. In many cases, the criterion of optimality is 
minimal dimension of the model, but it may be replaced by some other con- 
siderations, or incorporate scientific or economical concerns. For example, 
in small area estimation (SAE, e.g., Rao [24]) a main problem of interest is 
the prediction of small area means. Thus, some measure of the prediction 
errors, such as the mean squared prediction error, should be taken into ac- 
count in selecting the optimal model within the fence. By the way, the linear 
mixed model (2), also known as the nested-error regression model (e.g., Bat- 
tese, Harter and Fuller [4]), has extensive applications in SAE. The fence is 
constructed through the following inequality: 

(3) Qm <QM + Cnd- j^j^j^j , 

where ^j^j^^ is an estimate of the standard deviation of Qm — Qm, denoted 
by cr.jjr^. It can be shown that the latter is an appropriate measure of 
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Qm — Qm ^ correct model M; while for an incorrect model, the difference 
is expected to be much larger. Furthermore, Cn denotes a tuning constant. 
For consistency of the model selection (see Section 4), it is required that c„ 
increase (slowly) with the sample size. Here consistency is in the sense that 
as the sample size increases, the probability that the procedure selects an 
optimal model approaches one. In Section 3.2, we show how to choose c„ 
adaptively in order to improve the finite sample performance. 

In case the minimal dimension criterion is used, an effective algorithm is 
outlined below. 

2.2. The fence algorithm. For simplicity, consider the case that M is 
unique. Let di < ^2 < • • • < be all the different dimensions of the models 
M £ M. We proceed as follows: 

(i) Consider Mi = {M £ M : \M\ = di and (3) holds}; if Mi ^ 0, stop 
(no need for any more computation). Let Mq £ Mi he such that Qmq = 
minMgXi Qm] Mq is the selected model. 

(ii) If Ml = 0, consider M2 = {M £ M : \M\ = ^2 and (3) holds}; if 
M2 7^ 0, stop. Let Mo £ M2 be such that Qmo = ™inMg_vi2 Qm', Mq is the 
selected model. 

(iii) Continue until the program stops (it will at some point). 

In short, the algorithm may be described as follows: Check the candidate 
models, from the simplest to the most complex. Once one has discovered a 
model that falls within the fence and checked all the other models of the 
same simplicity (for membership within the fence), one stops. 

2.3. Estimation of aj^j j^^. In some cases, this is straightforward. For ex- 
ample, suppose that the likelihood function is available, and Qm is chosen as 
the negative log-likelihood. Furthermore, suppose that Mf £ M. Then under 
some regularity conditions 2{Qm — Qm{) has an asymptotic Xd distribution 
with d = \M{\ - \M\. Thus, if M = Mf, we have aM,Mc ~ \/(|Mf| - |M|)/2. 

However, such an asymptotic distribution may not exist in general. 
Nevertheless, suppose that M* is true. Then in the case of clustered ob- 
servations one can approximate, under some regularity conditions, o'j^.f m* 
by var(QM — Qm*)- Furthermore, suppose that Q^f can be expressed as 
J2iLi QM,i, where QM,i = QM,i{yi,6M)- Then vaxiQu - Qm*) = EEEli(QM,i - 
QM*,if' — Z]i^i{E((5M,j) — E((5M*,i)}^]- Thus, an observed variance is ob- 
tained by removing the outside expectation and replacing the parameters 
and inside expectations by their estimators. See Jiang et al. [17] for more 
detail. The latter also considered several cases of nonclustered observations, 
including Gaussian mixed models, non-Gaussian linear mixed models and 
GLMMs. 
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3. Variations. In this section, we propose two variations of the fence. The 
first aims at making the fence procedure computationally more attractive. 
The second focuses on choosing the tuning constant Cn to improve the finite 
sample performance of the fence. 

3.1. A stepwise fence procedure. As mentioned, the fence has the com- 
putational advantage that it starts with the simplest models, and, therefore, 
may not need to search the entire model space in order to determine the op- 
timal model. On the other hand, such a procedure may still involve a lot of 
evaluations when the model space is large. For example, in quantitative trait 
loci mapping, variance components arising from the trait genes, polygenic 
and environmental effects are often used to model the covariance structure 
of the phenotypes given the identity by descent sharing matrix (e.g., Almasy 
and Blangero [3] ) . Such a model is usually complex due to the large number 
of putative trait loci. To make the fence procedure computationally more 
attractive to large and complex models, we propose the following variation 
of fence for situations of complex models with many predictors. 

To be more specific, consider the extended GLMMs introduced by Jiang 
and Zhang [15]. It is assumed that given a vector a of random effects, 
the responses yi, . . . ,y„ are conditionally independent, such that E(yj|a) = 
h{x^(3 + z'^a), I <i <n, where h{-) is a known function, /3 is a vector of 
unknown fixed effects and known vectors. Furthermore, it is as- 

sumed that a ~ A^(0, S), where the covariance matrix S depends on a vector 
ip of variance components. Let (3m and ipM denote P and tp under M, and 

1 /2 

9M,i{/3M,'4'M) = ^{hMix'iPM + z'i^M 0}' where Hm is the function h under 
M, 'Em is the covariance matrix under M evaluated at ipM, and the expec- 
tation is taken with respect to ~ N{0,lm) (which does not depend on M). 
Here m is the dimension of a and Im the m-dimensional identity matrix. 

Let Qm = J2i=i{yi - gM,i{PM,i^M)Y- 

Write X = (x9i<i<n and Z = {z'j)i<i<n- We assume that there is a col- 
lection of covariate vectors Xi , . . . , Xk , from which the columns of X are 
to be selected. Furthermore, we assume that there is a collection of matri- 
ces Zi,. . . ,Zl such that Za = J2seS ^sC(s, where S C {I, . . . , L}, and each 
as is a vector of i.i.d. random effects with mean and variance a^. The 
subset S is subject to selection. The parameters under an extended GLMM 
are the fixed effects and variances of the random effects. Note that in this 
case, the full model corresponding to Xp + Za = J2k=i ^kPk + J2b=i ^I'^i 
is among the candidate models. Thus, we let M = Mf . The idea is to use 
a forward-backward procedure to generate a sequence of candidate mod- 
els, among which the optimal model is selected using the fence method. We 
begin with a forward procedure. Let Mi be the model that minimizes Qm 
among all models with a single parameter; if Mi is within the fence, stop 
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the forward procedure; otherwise, let M2 be the model that minimizes Qm 
among all models that add one more parameter to Mi; if M2 is within the 
fence, stop the forward procedure; and so on. The forward procedure stops 
when the first model is discovered within the fence. The procedure is then 
followed by a backward elimination. Let be the final model of the for- 
ward procedure. If no submodel of Mk with one less parameter is within the 
fence, will be our selection; otherwise, is replaced by Mk+i which is 
a submodel of Mk with one less parameter and is within the fence, and so 
on. We call such a variation of fence the forward-backward (F-B) fence. 

3.2. Adaptive fence procedure. In this section, we address the issue re- 
garding choosing the tuning constant c„ involved in (3). According to The- 
orem 1 in the sequel, for consistency of the fence, one needs Cn —> 00 at a 
certain rate, but there are many c,„'s that satisfy this requirement. Also note 
that although for the consistency it is not required that dj^.^ be a consistent 
estimator of ^ as long as it has the right order (see Section 4) , there is 
always a constant involved which may make a difference in a finite sample 
situation. The problem can be solved by choosing a suitable Cn- 

We now introduce the idea of an adaptive procedure. Recall that Ai de- 
notes the set of candidate models, which includes a true model. To be more 
specific, we assume that there is a full model Aff G A^, hence M = Mf in 

(3) ; and that every model in M \ {A/f} is a submodel of a model in A4 
with one less parameter than Mf. Let M* denote a model with minimum 
dimension among M G Ai. First note that ideally, one wishes to select Cn 
that maximizes the probability of choosing the optimal model. Here for sim- 
plicity, the optimal model is defined as a true model that has the minimum 
dimension among all true models. This means that one wishes to choose Cn 
that maximizes 

(4) P = P(Mo = Mopt), 

where Mopt represents the optimal model and Mq = Mo(c,i) is the model 
selected by the fence procedure with the given Cji. However, two things are 
unknown on the right-hand side of (4): (i) under what distribution should 
the probability P be computed? and (ii) what is Mopt? 

To solve problem (i), note that the assumptions above on M imply that 
Mf is a true model. Therefore, it is possible to bootstrap under Mf. For 
example, one may estimate the parameters under Mf, then use a model- 
based bootstrap to draw samples under Mf. This allows us to approximate 
the probability distribution P on the right side of (4). 

To solve problem (ii), we use the idea of maximum likelihood. Namely, let 
p*(M) = P*(Mo = M), where M £ M and P* denotes the empirical proba- 
bility obtained by bootstrapping. In other words, p*{M) is the sample pro- 
portion of times out of the total number of bootstrap samples that model M 
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is selected by the fence method with the given Cn- Let p* = maxA/gA^ p* (M) . 
Note that p* depends on Cn ■ The idea is to choose Cn that maximizes p* . It 
should be kept in mind that the maximization is not without restriction. To 
see this, note that if = then p* = 1 (because when c„ = the procedure 
always chooses Mf). Similarly, p* = \ for very large c„, if is unique (be- 
cause when Cn is large enough the procedure always chooses M*). Therefore, 
what one looks for is "the peak in the middle" of the plot of p* against c„ . 

Here is another look at the method. Typically, the optimal model is the 
model from which the data is generated, then this model should be the most 
likely given the data. Thus, given c„, one is looking for the model (using the 
fence procedure) that is most supported by the data or, in other words, one 
that has the highest (posterior) probability. The latter is estimated by boot- 
strapping. Note that although the bootstrap samples are generated under 
Mf, they are almost the same as those generated under the optimal model. 
This is because the estimates corresponding to the zero parameters are ex- 
pected to be close to zero, provided that the parameter estimators under 
Mf are consistent. (Note that in some special cases, a nonmodel based boot- 
strap algorithm can also be used. For instance, in the case of crossed random 
effects, Owen [22] presents a pigeonhole bootstrap algorithm which can be 
used effectively.) One then pulls off the c„ that maximizes the (posterior) 
probability and this is the optimal choice, denoted by c* . 

A few technical issues deserve some attention: 

1. Quite often the search for the peak in the middle finds multiple peaks 
(see Figure 1). In such cases, one should pick the highest. This is supported 
by our theoretical result, namely. Theorem 3 in the sequel which shows 
that p*(c*) ^ 1 in probability as n — > oo. It is also very common to have 
interval(s) of c„ at which p* is at the maximum, say, p* = 1. We then take the 
median of each interval, and let c* be the smallest of those medians, if there 
are more than one. The latest strategy is called conservative. For example, 
in the case of variable selection this strategy intends to make sure that no 
important variable is missing (in other words, to minimize the probability 
of underfit). 

2. There are two extreme cases which occur when the optimal model is 
either the full model, Mf, or the minimum model, M^. It should be pointed 
out that these cases are rare in practice. For example, in most cases of 
variable selection, there are a set of candidate variables and only some of 
them are important. This means that the optimal model is neither the full 
model nor the minimum model. Furthermore, when the extreme cases do 
occur, they are often easy to identify from the plot of p* (see Figure 1). 
Alternatively, one can run screen tests for the extreme cases. Such tests are 
recommended as supplementary tools to the inspection of the plot. 

The first screen test is called full model test. The idea is the following. 
Define A^f_i as the set of all models with one less parameter than Mf. 
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Fig. 1. Upper left: p* versus Cn without adjusting the baseline under Model 4- Upper 
right: p* versus without adjusting the baseline under Model 5. Lower left: p* versus Cn 
with adjusting the baseline under Model 5. Lower right: p* versus Cn with adjusting the 
baseline under Model 1. All plots are made based on 100 bootstrap samples generated given 
the first simulated dataset. 



Suppose that when Mf is the optimal model, we have E{Qm — Qmi) ~ clu, 
VM G A^f-i- Here ti„ ~ Vn means that both Un/vn and Vn/un are bounded. 
On the other hand, if Mf is not optimal, there is M G which is a 

true model, hence E((5a/ — Omj) = 0{bn), where bn = o{an)- It follows that 
minMeMi-i E(<5a/ — Qui) = 0{hn). Therefore, we consider 

{minMGA4f_iE((5M-<9A/f)}^ 
(5) qn = —7 ■ 

In practice, g„ is replaced by its bootstrap estimate, q'*, obtained as above. 
If (7* < 1, the full model test passes; otherwise, the full model test fails, 
in which case we assign cj^ = 0. The second screen test is called minimum 
model test. For simplicity, we assume that there is a unique M^, £ M that 
has the minimum dimension. Suppose that E{Qm^ — Qut) = 0{gn) if M^, 
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is incorrect; and the order becomes 0{hn) if is correct (hence optimal), 
where /i„ = o[gn). We then consider 
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Let r* be the bootstrap version of r^. If r* > 1, the minimum model test 
passes; otherwise, the minimum model test fails, in which case we assign c* 
as the upper bound of a sequence of values considered (see below). 

One concern about the screen tests is that the quantities a„, 6„, hn 
may subject to scale change. Throughout this paper, we choose those quan- 
tities naturally without additional constants. For example, if = 0{n), we 
simply take an = n (not 2n or 3n). On the other hand, the minimum model 
test can be replaced by the following threshhold checking which does not 
suffer from the scale change. Assuming that M* is true (therefore, optimal), 
one can draw bootstrap samples y*^, b = 1, . . . , B under M^. Then based on 

such bootstrap samples, compute = maxi<i,<B{Q*M^ b ~ Q*M* foi' where 

Mj* is defined below. If Qm, — Qm* > d^, do not consider the right tail 
of the plot of p* against Cn that goes up and stays at one (see Figure 1); 
otherwise, consider it. Unfortunately, the same idea does not apply to the 
full model case. To see why, note that the threshhold checking is similar 
to hypothesis testing. In the minimum model case, the null hypothesis is 
that M^: is true, therefore, one can draw bootstrap samples under the null. 
However, in the full model case, the null hypothesis is that Mf is optimal, 
which is equivalent to that none of the models in Ai /_i [defined above (5)] is 
true. We do not know how to draw bootstrap samples under such a null. To 
solve this problem, we use a method called adjusting the baseline. Consider, 
for simplicity, the problem of selecting the fixed covariates under a linear 
mixed model. Suppose that the candidate variables are Xi, . . . , Xs- Create 
an additional variable that is unrelated to the data, for example, by gener- 
ating a random vector X*_^i whose components are i.i.d. ~ N{0, 1) and are 
independent of the data. Define the model Mj* as the model that includes 
Xi,. . . ,Xs,X*_^_i. Then replace QMf in the fence inequality by Qm*- Note 
that even though the baseline is adjusted, is not considered as a candi- 
date model (because we know it is not optimal). Note that after the baseline 
change, p* will not equal to one when c„ = (see Figure 1). Although the 
standard normal distribution is used to adjust the baseline, our simulation 
results (see Section 5.1) suggest that the method is quite stable with respect 
to different choices of baselines. 



Remark. In practice, if there is belief that M^, and Mf are unlikely 
to be the optimal model, neither the screen tests nor the baseline adjust- 
ment/threshhold checking are necessary. 
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3. Finally, one needs to determine at which values of c„ to evaluate p* . 
Theoretically, the range of c„ is [0,oo), but practically one needs an up- 
per bound. This can be determined as follows. Note that any c„ > i? = 
(Om* — QMi)/o'M,,Mi makes no difference to the fence procedure (assuming 
no baseline adjustment). This is because then (3) is satisfied by M^^, hence 
Mq = Mtf. Therefore, we choose the upper bound of Cn as B* = [B\ + 1. We 
then divide the interval [0,i?*] by subintervals of equal length and consider 
the end points. 

Remark. It turns out that requiring the existence of a full model or 
other known true model from which to draw bootstrap samples is not much 
of a practical problem, because in essence the adaptive fence can be done 
in two steps. In the first step, one could use the fence with a fixed c„ (e.g., 
Cn = 1) to select a true model (which may not be optimal). Then in the 
second step, one applies the adaptive fence procedure with bootstrap samples 
drawn under the true model selected in the first step. Note that in the first 
step, one does not need Cn to increase in order to select (with probability 
tending to one) a true model. 

4. Consistency of fence, F-B fence and adaptive fence. We assume that 
the following Assumptions A1-A4 hold for each M S A^, where Om repre- 
sents a parameter vector at which E(Qm) attains its minimum, and OQm /dGM^ 
and so forth, represent derivatives evaluated at 9m ■ Similarly, OQm /QOm, 
and so forth, represent derivatives evaluated at Om- 

Assumption A1 
respect to 9m] and 

(7) 

Assumption A2. There is a constant Em such that Qm{9m) > Qm{9m), 
if I^mI >5m. 

Assumption A3. The equation OQm /d9M = has an unique solution. 

Assumption A4. There is a sequence of positive numbers a„ — > oo 
and < 7 < 1 such that the following hold: dQM/d9M — ^{dQM/d9M) = 
Op{al), d^QM/d9Md9'j,j-Eid^QM/deMd9'j,j) = Ov{al), liminf a-U„,in{E x 
{d'^QAi/dOMdd'j^)} > 0, limsupa;^Umax{E((9^QM/56'M56'^./)} < oo, and there 
is 5m > such that sup^g^^_g^^^^g^^ \d^QM/d9M,jd9M,kd9M,i\ = Op(an), 1 < 
i, k, 1<PM, where pM = dim(6'A./). 



Qm is three-times continuously differentiable with 



FENCE METHODS FOR MIXED MODEL SELECTION 



13 



In addition, we assume the following. Recall that Cn is the constant in 
(3). 

Assumption A5. c„ — > oo; V true model M* and incorrect model M, we 
have E{Qm) > 'E{Qm*), \immi{aMM* f^^'^?'^) > and CnCTM^M* /{E{Qm) - 
E(Qm*)}^0. 

Assumption A6. aM,M* > ^^^^ ^m,m* = o-m,m*Op{1) if M* is true 
and M incorrect; and (Tm,m* V a'^~^ = ^m,m*0p{1) if both M and M* are 
true. 

Note. (7) is satisfied if E{Qm) can be differentiated inside the expecta- 
tion. Assumption A2 implies that \9m\ ^ Bm- To illustrate Assumptions A4 
and A5, consider the case of clustered responses (see the last paragraph 
of Section 2.3). Then under regularity conditions, Assumption A4 holds 
with an = rri and 7 = 1/2. Furthermore, we have crM,M* = 0{y/rn) and 
E(Qm) — E{Qm*) = 0{m), provided that M* is true, M is incorrect and 
some regularity conditions hold. Thus, Assumption A5 holds with 7 = 1/2 
and Cn being any sequence satisfying — > 00 and Cn/\/rn^ 0. Finally, 
Assumption A6 does not require that aM,M* be a consistent estimator of 
<^M,M* — o'^^y t^aX it has the same order as aM,M*- 

Lemma 2. Under Assumptions A1-A4, we have 6m — = Op{a'J^'^) 
and Qm - Qm = Op{al^~^). 

Let Mq be the model selected by fence using (3). The following theorem 
establishes consistency of the fence procedure. 

Theorem 1. Under Assumptions A1-A6, we have with probability tend- 
ing to one that Mq is a true model with minimum dimension. 

The proofs of Lemma 2 and Theorem 1 are given in Sections 7.1 and 7.2, 
respectively. 

The next theorem establishes consistency of the F-B fence proposed in 
Section 3.1. Note that the method is introduced in the case of extended 
GLMMs. Let Mq be the final model selected by the F-B fence procedure 
using (3). 

Theorem 2. Under Assumptions A1-A6, we have with probability tend- 
ing to one that Mq is a true model and no proper submodel of Mq is a true 
model. 
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Note that here consistency is in the sense that with probabihty tending 
to one, Mq is a true model which cannot be further reduced or simphfied. 
The proof is given in Section 7.3. 

Finally, we give sufficient conditions for the consistency of the adap- 
tive fence procedure introduced in Section 3.2. For simplicity, assume that 
Mopt is unique. Consider the ratios vm = {Qm — Qm{) / ^M,Mf M £ Ai. Let 
A^w< denote the subset of incorrect models with dimension < | Mopt I ■ Write 
''opt = A/opt and r„< = minA/ex„< ta/. Denote the c.d.f.s of ropt and r„< 
by Fapt and F^<, respectively. Let Mq{x) be the model selected by the fence 
procedure using (3) with c„ = x, and P{x) = F{Mq{x) = Mopt). Let P*{x) 
be the bootstrap version of P{x). Denote the bootstrap sample size by n* . 
Recall the definitions of a„, 6„, g„, in (5), hn, rn, r* in (6), and B* 
above the final remark of Section 3.2. We make the following assumptions. 

Assumption A7 {Asymptotic distributional separation). If Mopt ^ {Mf, 
M*}, then for any e > 0, there is < < 0.1, Xn^i < Xn,2 < Xn,3, and > 1 
such that when n'> N the following hold: -Fopt(a^n,i) > 1 — e, F„<:{xn,3) < s, 
P{xn,2) > 1 - (5, 1 - 4(5 < P{xn,j) < 1 - 3(5, J = 1,3; if Mopt = Mf , we have 
P{mmMeM,M^Mf Qm > Qmi) ^ 1 as n ^ oo. 

Assumption A8 {Good bootstrap approximation). If Mopt ^ {Mf,M*}, 
then for any J, 77 > 0, there are > 1, A^* = N*{n) such that, when n> N 
and n*>N*, we have P(sup3.>o \P*ix) - P{x)\ < (5) > 1 - ??; if Mopt = Mf , 
we have qn/Qn = Op(l); if Mopt = M*, we have Qn/qn = Op(l) and r*/r„ = 
Op(l). 

For the most part, Assumption A7 says that there is an asymptotic sepa- 
ration between the optimal model and the incorrect ones that matter in that 
the peak of P{x) is distant from the area where r^< concentrates. This is 
reasonable because, typically, ropt is of lower order than r^<. Therefore, one 
can find an interval, {xn,i,Xn,3), such that (3) is almost always satisfied by 
M = Mopt when c„ S {xn,i,Xn,3). On the other hand, {xn,i,Xn,3) is distant 
from the area where rw< concentrates, so that ropt < Cn, rw< > Cn with high 
probability, if c„ G {xn,i,Xn,3)- Thus, P{x) is expected to peak in {xn,i,Xn,3) 
while F^<{x) stays low in the region. 

Recall that p* in the adaptive procedure is a function of c„, that is, 
p* =p*{cn). The following theorem establishes consistency of the adaptive 
fence. The proof is given in Section 7.4. 

Theorem 3. Under Assumptions A7 and A8, the following hold: 

(i) If Mopt ^ {Mf, A/*}, then with probability tending to one there is c* G 
(0, 00) which is at least a local maximum and approximate global maximum 
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of p* in the sense that for any 5,r] > 0, there is N >l and N* = N*{n) such 
that P(p*(c*) >l-S)>l-r], ifn>N and n* > N* . 
(ii) In general, define c* as 



the c* in (i), 
1, 

Let Mq be the model selected by the fence procedure using (3) with M = Mf 
and Cn replaced by c* . Then Mq is consistent in the sense that for any rj > 
there is N >1, N* = N*{n) such that F{M^ = M^pt) > 1 - ^, ifn>N and 
n* > N*. 

5. Examples of simulations and data analysis. 

5.1. The Fay-Herriot model — an illustration of adaptive fence method. 
The Fay-Herriot model is widely used in small area estimation. It was first 
proposed to estimate the per-capita income of small places with population 
less than 1,000 (Fay and Herriot [8]). The model can expressed as yi = x[f3 + 
Vi + Ci, i = 1, . . . ,m, where Xi is a vector of known covariates, /3 is a vector 
of unknown regression coefficients, fj's are area-specific random effects and 
Cj's represent sampling errors. It is assumed that independent 
with Vi ~ A^(0,^) and Cj ~ N{0,Di). The variance A is unknown, but the 
sampling variances Dj's are assumed known. 

Let X = {x'^i<i<rn, so that the model can be expressed as y = XP + v + e, 
where y = (yi)i<j<m, v = (■Ui)i<i<m and e = (ej)i<j<m- The first column of 
X is assumed to be Im which corresponds to the intercept. The rest of the 
columns of X are to be selected from a set of candidate covariate vectors 
X2, . . . ,Xk, which include the true covariate vectors. First note that by ap- 
plying the following transformation, we can simplify the problem to the case 
Di = l. Let D = l + maxi<j<fri -Dj . Draw independent samples ui, . . . ,Um 
independent with the Uj's and e^'s such that Ui ~ N{0,D — Di), 1 < i < m. 
Then let iji = {yi + Ui)lyrD, Xi = Xi/y/D, Vi = Vij \fD and Cj = {ci + Uj)/ \fD. 
Consider y^'s as the new observations. Then we have yi = x[[3 + vi + Ci, 
i = 1, . . . ,m, where Vi, Cj, i = 1, . . . ,m, are independent with Vi ~ A^(0, A), 
A = A/D and ej ~ N{0, 1). Thus, without loss of generality, we let Di = 1, 
1 <i <m. 

Consider the fence ML model selection (see Section 2). It is easy to 
show that in this case, Qm = ("T'/2){1 + log(27r) + log{\Px-i-y\'^ /m)} , where 
Px-L = Im ~ Px and Px = X{X' X)~^X' . We assume for simplicity that 
X is of full rank. Then Qm - Qm, = im/2)log{\Px±y\^ /\Px^^y\^)- Further- 
more, it can be shown that when M is a true model, we have Qm — Qm{ = 



ifq*n>^, 

i/< < l,r* < 1, 

2/ 9^ < 1, "rj^ > 1 and such a c* exists, 
otherwise. 
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Table 1 

Fence methods with different c„'s in the Fay-Herriot model 



Optimal model 


1 


2 


3 


4 


5 


Adaptive c„ (ST) 


100 


100 


100 


99 


100 


Adaptive c„ (B/T) 


99 


100 


100 


99 


100 


c„ =loglog(n) 


52 


63 


70 


83 


100 


c„ = log(n) 


96 


98 


99 


96 


100 




100 


100 


100 


100 


100 


c„ = n/log(n) 


100 


91 


95 


90 


100 


c„ = n/loglog(n) 


100 
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(m/2)log(l + ^_^_i F), where p + 1 is the number of columns of X, and 
F ~ FK-p,m-K-i- Therefore, (Jm,M{ is completely known given \M\ and can 
be evaluated accurately (e.g., by numerical integration). 

We carry out a simulation study to evaluate the performance of the adap- 
tive method. Here we consider the adaptive method assisted either by the 
screen tests (ST) or the baseline adjustment /threshhold checking (B/T). 
We consider a (relatively) small sample situation with m = 30. With = 5, 
X2,...,X5 were generated from the A^(0, 1) distribution, and then fixed 
throughout the simulations. The candidate models include all possible mod- 
els with at least an intercept (thus, there are 2^ = 16 candidate models). 
We consider five cases in which the data y is generated from the model 
y = Ei=i/5i^i + ^ + e, where /?' = (A, . . . , /^s) = (1, 0, 0, 0, 0), (1,2,0,0,0), 
(1,2,3,0,0), (1,2,3,2,0) and (1,2,3,2,3), denoted by Models 1, 2, 3, 4, 5, 
respectively. The true value of ^ is 1 in all cases. The number of bootstrap 
samples for the evaluation of the p*'s is 100. 

In addition to the adaptive method, we consider five different (nonadap- 
tive) c„'s (n = m in this case), which satisfy the consistency requirements 
given in Theorem 1 (note that these requirements reduce to Cn — > oo and 
Cn/n — > in this case). These are Cn = loglog(n), log(n), y/n, n/log(n) and 
n/loglog(n). Reported in Table 1 are percentage of times, out of 100 simu- 
lations that the optimal model was selected by each method. 

Summary. Although the reported results for Adaptive c„ (B/T) were 
obtained using A^(0, 1) for the baseline adjustment, the same simulations 
were carried out when A^(0, 1) is replaced by Uniform [0, 1], Poisson(l) and 
Bernoulli distributions. The only (slight) differences in the results are those 
under Model 1, which are 99, 98 and 100, respectively, for Uniform[0, 1], 
Poisson(l) and Bernoulli. This suggests that the method is not very sensitive 
to different choices of baselines which is what one desires. Figure 1 displays 
the plots oip* against c„ in a number of situations. Furthermore, we explore 
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the two-step adaptive fence procedure (with ST) described in the last remark 
of Section 3.2 and the same results were obtained. 

It seems that performance of the fence with Cn = log(n), y/n or n/log(n) 
is fairly close to that of the adaptive fence. In any particular situation, one 
might get lucky to find a good c„ value by chance, but one cannot be lucky 
all the time. In fact, for more complicated mixed models, the definition of 
the sample size may not simply be the total number of observations or the 
number of clusters so, for example, something like log(n) or y/n may not 
make sense. 

Computational note. The simulations of this subsection were run on 
a Pentium Dual Core CPU 3.2 GHz, memory 4 GB, Harddrive 500 GB. The 
times it took to run the first simulation of Adaptive c„ (B/T) under Models 
1-5 were 1.7 sec, 3.0 sec, 4.1 sec, 4.4 sec. and 5.3 sec, respectively. 

5.2. Linear mixed models for clustered data. We consider the follow- 
ing linear mixed model (see Jiang and Rao [16]), Uij = x'^jP + ai + Eij, 
i = 1, . . . ,m, j = 1, . . . , K , where Xij is a vector of covariates and (3 a vec- 
tor of unknown regression coefficients (the fixed effects) . The random effects 
ai, . . . , am, are generated independently from A^(0, cr^). The errors are gener- 
ated so that £i = {£ij)i<j<K, i = 1, ■ . ■ ,m, are independent and multivariate 
normal with Var(ej) = t^{(1 — p)I + pJ}, where / is the identity matrix and 
J matrix of I's. Finally, the random effects are uncorrelated with the errors. 

Now pretend that the covariance matrix of the data is unknown. The 
problem is to select the fixed covariates. Write the model as y = XP + Za + £. 
The candidate covariates which are columns of X are Xi , . . . , Xr, , where Xi 
is a vector of I's and X2, ■ ■ ■ ,^4 are generated randomly from the A^(0, 1) 
distribution, and then fixed throughout the simulations. We consider the 
Qm for LS model selection (described above Section 2.1) which is suitable 
for this situation. 

We examine the performance of fence with fixed = 1.1 and that of the 
adaptive fence. As comparison, two GICs developed in Jiang and Rao [16]) 
are considered, which are similar to (1) for this problem: (i) = 2 which 
corresponds to the Cp method; and (ii) = logn, where n = mK, which 
corresponds to the BIG. The latter choice satisfies the conditions of Theorem 
1 in Jiang and Rao [16] for consistent model selection for this setting. 

We consider the case where the errors have varying degrees of exchange- 
able structure. Four values of p were considered: 0,0.2,0.5,0.8. The random 
effects and errors were simulated from normal distributions with a = t = 1. 
We set the number of clusters to be m = 100 and the number of observations 
within a cluster to he K = 5. Three (true) /3's are considered: (2,0,0,4,0), 
(2,9,0,4,8) and (1,2,3,2,3). A total of 100 realizations of each simulation 
were run. 
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Summary. The results reported in Table 2 for adaptive fence are those 
under B/T (see the previous subsection). The same results were obtained 
under ST. The fence method with fixed c„ is seen to have robust selection 
performance in most situations considered. In cases where the true model 
was relatively small in dimension, the fence method suffers some from over- 
fitting. The overfitting proneness in these few situations is less than that 
found when using Cp but more than that found when using BIC. Selection 
performance in the second situation with a larger true model with high signal 
is solid for the fence method. However, in the last situation with the opti- 
mal model being the full model with all weak covariates, both BIC and Cp 
tend to underfit. The fence method still shines having excellent performance 
with comparatively little or no underfitting empirically observed (note that 
overfitting is not possible in this situation). The effect of increasing correla- 
tion in the errors (i.e., clustering) is to act as a means of reducing effective 
sample size. The end result is that as the correlation between observations 
within a cluster increases, selection performance for all fixed penalization 
methods degrades somewhat. The adaptive fence on the other hand shines 
in all situations giving 100% selection accuracy. This clearly demonstrates 
the effectiveness of the adaptive fence method (at a computational cost, of 
course). 

5.3. Prenatal care for pregnancy. This real-data example is an applica- 
tion of the F-B fence procedure to GLMMs (see Section 3.1). Rodriguez and 
Goldman [25] considered a dataset from a survey conducted in Guatemala 

Table 2 

Simulation results: linear mixed model selection. Reported are probabilities of correct 

selection (underfittmg, overfitting) as percentages estimated empirically from 100 
realizations of the simulation. Cp and BIC results for Models 1 and 2 were taken from 

Jiang and Rao [16] 



Optimal model 


P 


Cp 


BIC 


Fence (c„ = 1.1) 


Adaptive 


fer 


= (2,0,0,4,0) 





64(0, 36) 


97(0, 3) 


94(0, 6) 


100(0, 


0) 




0.2 


57(0, 43) 


94(0, 6) 


91(0, 9) 


100(0, 


0) 




0.5 


58(0, 42) 


96(1, 3) 


86(0, 14) 


100(0, 


0) 




0.8 


61(0, 39) 


96(0, 4) 


72(0, 28) 


100(0, 


0) 


= (2,9,0,4,8) 





87(0, 13) 


99(0, 1) 


100(0, 0) 


100(0, 


0) 




0.2 


87(0, 13) 


99(0, 1) 


100(0, 0) 


100(0, 


0) 




0.5 


80(0, 20) 


99(0, 1) 


99(0, 1) 


100(0, 


0) 




0.8 


78(1, 21) 


96(1, 3) 


94(0, 6) 


100(0, 


0) 


/?' = (1,2,3,2,3) 





85(15, 0) 


81(19, 0) 


100(0, 0) 


100(0, 


0) 




0.2 


79(21, 0) 


73(27, 0) 


100(0, 0) 


100(0, 


0) 




0.5 


74(26, 0) 


64(36, 0) 


97(3, 0) 


100(0, 


0) 




0.8 


44(56, 0) 


26(74, 0) 


94(6, 0) 


100(0, 


0) 
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regarding the use of modern prenatal care for pregnancies where some form 
of care was used (Pebley [23]). While Rodriguez and Goldman focused on 
assessing the performance of the approximation method that they developed 
in fitting a three-level variance component logistic model, we consider ap- 
plying the fence method in selection of the fixed covariates in the variance 
component logistic model. The models are described as follows. 

Suppose that given the random effects at community levels Uj, 1 <i <m 
and random effects at family levels Vij, 1 < i < m, 1 < j < n^, binary re- 
sponses i/ijk, 1 < i < in, I < j < Ui, 1 < k < conditionally indepen- 
dent with TTijk = E{yijk\u,v) = F{yijk = l\u,v). Furthermore, suppose that 
the random effects are independent with Ui ~ N{0,a^) and Vij ~ iV(0,r^). 
The following models for the conditional means are considered such that 
under model M, logit(7rjjfc) = X'j^ -^^^Pm + Ui + Vij, where XM,ijk is a sub- 
vector of the full set of fixed covariates and Pm the corresponding vector of 
regression coefficients. 

Let ip = {a'^,T'^y. The vector of parameters under model M is 9m = 
{(3'j^j,iIj')' . We use the Qm introduced earlier for extended GLMMs (see the 
second paragraph of Section 3.1). An estimated cr^Af* "^^^ ^e obtained us- 
ing the idea of observed variance (see Section 2.3, and Jiang et al. [17] for 
detail). The expectations involved in Qm are evaluated by numerical inte- 
gration. Since the number of covariates considered is quite large, to keep the 
computational time manageable, we apply the F-B fence procedure intro- 
duced in Section 3.1 with Cn = l- 

The data analysis has selected the following variables (in the order that 
they were selected in the forward procedure): Proportion indigenous (1981), 
Modern toilet in household. Husband's education secondary or better. Hus- 
band's education primary. Television watched daily. Distance to nearest 
clinic. Mother's education primary. Television not watched daily. Mother's 
education secondary or better. Indigenous (no Spanish), Indigenous (Span- 
ish), Mother age. Husband agriculture employee. Husband agriculture self- 
employee. Child age. Birth order 4-6 and Husband's education missing. 
There are some interesting differences between the fixed effects discovered 
by the fence versus those found by standard maximum likelihood analysis 
using a 5% significance level as reported in Rodriguez and Goldman [25]. 
First, Husband's education overall (primary or higher relative to the refer- 
ence group of no education for the husband) was found to be an important 
predictor whereas Rodriguez and Goldman found that only Husband's sec- 
ondary education was important. Our more uniform finding is also in line 
with the finding for Mother's education. The implication is that education of 
some kind is important for both the mother and husband to have. A similar 
kind of finding was observed for variables corresponding to husband's profes- 
sion. We found that regardless of what type of agricultural employment the 
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husband had, it was an important predictor overah. Rodriguez and Gold- 
man report that only nonself employed agricultural jobs for the husband 
mattered. The fence method also uniquely found that watching television 
(daily or not) was an important predictor. This can be intuitively justified 
since it provides a medium for women to learn more about modern prenatal 
health care methods, and thus make it more likely for them to choose to 
use such methods. Other findings were in line with those of Rodriguez and 
Goldman [25]. 

6. Concluding remarks. Fence is different from procedures like AIC, BIG 
in that there is no criterion function that is minimized. In other words, in- 
stead of trying to find an "optimal" model that minimizes a criterion func- 
tion, fence proposes to carry out the optimization by two steps. The first 
step is to identify the set of true models (the ones that are in the fence) 
or, in case a true model does not exist, the models that best approximate 
the real-life problem. Note that although in this paper we have assumed the 
existence of a true model, the method can be easily extended to the situ- 
ation where a true model does not exist, or is understood as the one that 
provides the best approximation. On the other hand, the second step of 
fence, which identifies the model with minimal dimension within the fence, 
is quite flexible. For example, the dimension of a model may not be defined 
as the number of estimated parameters (e.g., Hastie and Tibshirani [12], Ye 
[29]); or it may be replaced by some other considerations, such as economical 
concerns. In fact, practically speaking, optimality in model selection usually 
goes beyond statistics. Keeping this in mind, it appears that the fence pro- 
cedure is easier to incorporate with other scientific or economical criteria 
than minimizing a single criterion function determined before the scientific 
or economic problem. 

A good feature of the fence algorithm is that it needs not search over all 
the candidate models in order to find the optimal model. 

In this paper, we have demonstrated the robust performance of fence in 
linear and generalized linear mixed model selection. In addition, we have in- 
troduced a stepwise fence procedure to handle situations of large number of 
predictors. Furthermore, we have proposed an adaptive procedure for choos- 
ing a tuning constant involved in the fence method. The adaptive procedure 
improves the finite sample performance of fence at a computational cost. On 
the theoretical side, we have established consistency of the different fence 
procedures, with the proofs given in the next section. 
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7. Proofs. 



7.1. Proof of Lemma 2. Assumptions A2 and A3 imply that 6m is the 
unique solution to dQM/dOM = 0. By Taylor expansion, we have 

Qm — Qm 



V 89 M 



(Oai 



M 



d6M,j d9M,k 99 



M,l 



{9 m — 9 m) 

{(^M,j — 9M,j){9M,k — (^M,k){(^M,l 



h + \h + ]:h 
I o 



for any 9 m-, where d'^Q\^l ■•■ represents the third derivatives evaluated at 
9\;i^ which lies between 9m and 9m- For any e > 0, by Assumptions Al and 
A4, there are 5 > and A^o > 1 such that \raxr,{^{d QMld9Md9\i)\ > 5a„,, 
n ^ -^0, and Li > such that the probability is greater than 1 — e that 
\dQM/d9M\ < Lial, 



max 



d9Md9'^, 



sup 

\9m~9]^j\<Sm 



E 



d9Md9'j,, 



M 



d9M,j d9M,k 99 



MA 



Now choose -L2 > such that 5L2 > 2Li. Let &m,L2 = {^A/ : \9m — 9m\ < 
L2a1~^}, and Qm,L2 be the boundary of @m,L2, that is, Qmm = {^M : — 
9m\ = -^2an~^J- Then choose A^i > 1 such that L2a^~^ <&m-, n> Ni. It fol- 
lows that for 9 € 9m,L2; we have < LiL2a^~^ , I2 > 6L2a'^~^ — LiL^ 

Qm — Qm 



nil, 



2^n 



j|)3<LiLip^2^37-2 whence 



(8) 



> lL2al'"^{6L2 - 2Li - LiL2(l + ^L2p'j/.j 



3/2. 



\/9 G Qm,L2- If choose A''2 > 1 such that, when n > N2, the quantity inside 
{• • •} on the right-hand side of (8) is positive, and let N = NqV NiV N2, then 
we have with probability greater than 1 — e, that Qm > Qm, V0 G Qm,L2- 
It follows that F{\9m - 9m\ < L2al-^) > 1 - e, a n > N . This proves that 
9M-9M = Op{ar^). 

By similar arguments, it can be shown that for any e > 0, there are con- 
stants L, Li, L2 and N >1 such that, when n> N, 

27-1 , 1 r r2^27-l , 1 r r2^37-2 , Ir r3 3/2 3^-2 



Qm - Qm < LiL2a,^ 
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< L2{Li + 1(L + Li)L2 + lLiLlp%^}al^-' 
with probability > 1 — e. This proves that Qm — Qm = Op{a^~^). 

7.2. Proof of Theorem 1. For the most part, we show that with proba- 
bihty tending to one (w.p. — > 1), all the true models (with \M\ < |M|) are 
in the fence, and all the incorrect ones are out. 

Let M be an incorrect model and M* a true model. By Lemma 2 and As- 
sumption A5, we have Qm - Qm* = Qm - Qm* + Qm - Qm - 
{Qm* - Qm* ) = Qm- Qm* + Op (4^"' ) = E(Qm) - ^{Qm* ) + {Qm- Qm* - 
E(gM-QA/*)} + Op(an^~')=E(QM)-E(QA,/*) + ^A./,M*Op(l) = {E(QM)- 
E((5j\/*)}{1 + op(l)}. It follows that, w.p. — > 1, we have Qm > Qm*- This 
implies that w.p. — > 1, M is a true model (because an incorrect model cannot 
be the minimizer). 

Furthermore, it is seen from this argument that if M is incorrect, we have 

Qm — Qm* 



-){l + op(l)}-i 



(9) = CnCFMM* j^tn \ FCn Wrr 

Li^VVAf j — i^y^M*) XO^MM* 

Assumptions A5 and A6 imply that the quantity inside [• • •] in (9) is op(l). 
Therefore, w.p. — > 1, we have Qm > Qm* + Cn^M,M*- follows that 

P(|M| < |M|,Mg7W_) 

< ^{Qm < Qm + Cn^M,M) 

< J2 ^(.Q^i ^ Qm* + c„(7Af,A/*,^^ = M*) + P(M is incorrect) 

M* is true 

< ^ P(<5a/ < Qm* + Cn^M,M*) + P(-/W" is incorrect) 0. 

M* is true 

Let El = Hm is incorrect,|M|<|M|{^'-^ ^ -^^i' t^CU El = Um is incorrect { I ^ I < 

\M\,M £ M-}, hence P{Ef) 0. This proves the "out" part. 

On the other hand, if M and M* are both true models, then by the 
property of Qm, we have E{Qm) = ^{Qm*)- Therefore, by similar argu- 
ments and Assumption A6, we have Qm — Qm* = Qm — Qm*+ Op(a'n'~^) = 
<5"M,M*0p(l)- Since c„ oo, we have, w.p. 1, Qm < Qm* +Cnd-M,M*- I* 
follows that 

P(|M| < \M\,M^M-) 

< P{Qm > Qm + Cn<5- m,m) 

< ^(Qm > Qm* + CnaM,M*,M = M*) + P(M is incorrect) 

M* is true 
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< ^ P{Qm > Qm* + Cnd-M,M*) + F{M is mcoTiect) ^ 0. 

M* is true 

Let E2 = is truc,|JV/|<|M|{^'^ e M^}, then E^^ = is truell^^l < |M|, M ^ 

A^_}, hence P(£'2) — > 0. This proves the "in" part. 

Finally, note that {Mq is optimal} Eq Ei <^ E2., where Eq = {M is 
true}. 

7.3. Proof of Theorem 2. First note that like the fence procedure, the 
F-B fence is guaranteed to stop at some point. This is because, otherwise, 
one keeps adding the parameters until one gets the full model, which auto- 
matically satisfies the fence inequality (note that in this case M is chosen 
as the full model). 

Next we show that w.p. 1, Afg is a true model. Suppose that this is 
not the case. Then there is an incorrect model, say, M, such that 

(10) P(il4 = ^'I) > ^, 

where 5 > is a constant. Since M is a true model, we have by the proof of 
Theorem 1 that w.p. — > 1, Qm > Qm + CuCf^j j^^- On the other hand, Mq = M 
implies that Qm + Cnd'ji.f ^ [because Mq has to satisfy (3)]. Thus, we 

have P(M(J = M) < P{Qm <Qm + Cn^M,M) ^ 0' which contradicts (10). 

We next show that w.p. — > 1, no proper submodel of Mq is a true model. 
Suppose that this is not true. Then there is a true model Mi and a con- 
stant 6>0 such that P(A/i C Mq^) > 6. Hereafter, the notation Mi C M2 
(Ml C M2) means that Mi is a (proper) submodel of M2. Suppose that un- 
der Mq, XP + Za = J2reRo -^rPr + J2seSo ^sds, and, under Mi, the same 
expression holds with Rq, Sq replaced by Ri, Si, respectively. Define Riq = 
i?i U{ri,. . . ,ra_i}, Sio = So, if Ri C Rq, Si C Sq and Ro\Ri = {ri, . ..,ra}; 
Rio = Rq, Sio = SiU {si, . . . , Sb_i}, if Ri = Rq, Si C 5*0 and So\ Si = 
{si, . . . , Sfe}; and Rio = Ri, Sio = Si otherwise. Let Mio be the model corre- 
sponding to Rio and 5*10. Then Mi C Mq implies that Mio C Mq with one 
less parameter, hence we must have Q Mio > Om + '^nO'Mio M the definition 
of M(|. It follows that 

(11) P(<3mio > Qm + Cn<7A,/jo,M) > S. 

On the other hand, we have by the proof of Theorem 1 that for any true 
model M, w.p. — > 1, Qm ^ Qm + ^n(^M m- Since Mio is always a true 
model, it follows that P(Qmio > Qm + ^^Mio,m) - true P(Qm >Qm + 
Cnf^M m) ~^ ■^hich contradicts (11). 
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7.4. Proof of Theorem 3. (i) For any e, r] > 0, let 5, Xnj, j = 1,2,3, N 
and N* be as in Assumptions A7 and A8. Then when n> N and n* > N* , 
the following arguments hold with probability > 1 — ??• 

For j = 1, 3, we have P*(x„j) > P{xn,j) - 5 > 1 - 5(5 > 1/2. It follows that 
p*{Xn,j) = maxMeMP*{Mo{xn,j) = M) = P*{Xn,j) < P{Xn,j) +5<l-25. 
Similarly, p*{xn,2) = P*{xn,2) > P{xn,2) - 6 > I - 25. Thus, there is c* e 
{xn,i,Xn,'i) which is the maximum of p* over x^^s]. Furthermore, we 

have > p*{xn.2) >l-25. 

(ii) If Mopt = Mf , then qn ~ a„/6„, hence = (6„/a„)0(l) = o(l). Also, 
by Assumption A8, for any ?? > 0, there is L > such that P(gn/9n > 
L) < rj. Choose A^i > 1 such that < 1/L when n> Ni. Then, when 
n > Ni, we have, w. p. > I - rj, {q*)~^ = qn^{qn/q*n) < 1, hence q* > 1, 
hence c* = 0. On the other hand, by Assumption A7, there is A^2 ^ 1 such 
that F{mmMeM,Al^M{ Qm > Qhlf) > 1 - r], if n> N2. Let N = NiV N2, 
then P(Mo* = Mf) > 1 - 2?/, if n > TV. 

If Mopt = -^*, then by similar arguments, it can be shown that r* = op(l) 
and g'* = op(l). Thus, for any r] > 0, there is > 1 such that when n> N 
we have, w.p. > 1 — r/, < 1 and r* < 1, hence c* = B* , hence Mq = M*. 

If Mopt ^ {Mf,M4, note that {Mq* = Mopt} ^ {^opt < c;,r„< > <} D 
Xn,ij^w< > ^^^js}, if c*j G (s^n,! , 3^71,3) ■ Therefore, by (i), for any £, 
?7 > 0, we have 

P(A/o* = Mopt) > P(Mo* = Mopt, 4 e {Xn,l,Xn,3)) 

> P(?^opt < 2;„,l, r„< > Xn,3, C* G (Xn^i, X„,3)) 

> i^opt(a:;n,i) - F^<ixn,3) - P« ^ (a:;n,i,a;„,3)) 
>l-2e-r?,n>Af,?i*>Af*. 

Acknowledgments. The authors are grateful to the Associate Editor and 
referees for their constructive comments that have led to the development 
of adaptive fence and other improvements. 

REFERENCES 

[1] Akaike, H. (1973). Information theory as an extension of the maximum likelihood 

principle. In Second International Symposium on Information Theory (B. N. 

Petrov and F. Csaki, eds.) 267-281. Akademiai Kiado, Budapest. MR0483125 
[2] Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. 

Automatic Control 19 716-723. MR0423716 
[3] Almasy, L. and Blangero, J. (1998). Multipoint quantitative-trait linkage analysis 

in general pedigrees. Am. J. Hum. Genet. 62 1198-1211. 
[4] Battese, G. E., Harter, R. M. and Fuller, W. A. (1988). An error-components 

model for prediction of county crop areas using survey and satellite data. J. 

Amer. Statist. Assoc. 80 28-36. 



FENCE METHODS FOR MIXED MODEL SELECTION 



25 



[5] BozDOGAN, H. (1994). Editor's general preface. In Proceedings of the First US/Japan 
Conference on the Frontiers of Statistical Modeling: An Informational Approach 
(H. Bozdogan et al., eds.) 3 ix-xii. Kluwer Academic Publishers, Dordrecht. 
MR1328885 

[6] Datta, G. S. and Lahiri, P. (2001). Discussion on "Scales of evidence for model 
selection: Fisher versus Jeffreys," by B. Efron and A. Gous. In Model Selection 
(P. Lahiri, ed.) 208-256. IMS, Beachwood, OH. MR20000754 

[7] DE Leeuw, J. (1992). Introduction to Akaike (1973) Information theory and an 
extension of the maximum likelihood principle. In Breakthroughs in Statistics 
(S. Kotz and N. L. Johnson, eds.) 1 599-609. Springer, London. 

[8] Fay, R. E. and Herriot, R. A. (1979). Estimates of income for small places: An 
apphcation of James-Stein procedure to census data. J. Amer. Statist. Assoc. 
74 269-277. MR0548019 

[9] Fabrizi, E. and Lahiri, P. (2004). A new approximation to the Bayes information 
criterion in finite population sampling. Technical report. Dept. Mathematics, 
Univ. Maryland. 

[10] Hannan, E. J. and Quinn, B. G. (1979). The determination of the order of an 
autoregression. J. Roy. Statist. Soc. Ser. B 41 190-195. MR0547244 

[11] Hartley, H. O. and Rao, J. N. K. (1967). Maximum likelihood estimation for the 
mbced analysis of variance model. Biometrika 54 93-108. MR0216684 

[12] Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. Chapman and 
Hall, London. MR1082147 

[13] Harville, D. a. (1977). Maximum likelihood approaches to variance components es- 
timation and related problems. J. Amer. Statist. Assoc. 72 320-340. MR0451550 

[14] Hodges, J. S. and Sargent, D. J. (2001). Counting degrees of freedom in hierarchi- 
cal and other richly-parameterised models. Biometrika 88 367-379. MR1844837 

[15] Jiang, J. and Zhang, W. (2001). Robust estimation in generalized linear mixed 
models. Biometrika 88 753-765. MR1859407 

[16] Jiang, J. and Rao, J. S. (2003). Consistent procedures for mixed linear model 
selection. Sankhyd 65 23-42. MR2016775 

[17] Jiang, J., Rao, J. S., Gu, Z. and Nguyen, T. (2006). Fence methods for 
mixed model selection. Technical report. Available at http://anson.ucdavis.edu/ 
"jiang/jpl0.r3.pdf. 

[18] Lahiri, P., ed. (2001). Model Selection. IMS, Beachwood, OH. MR2000750 

[19] Meza, J. and Lahiri, P. (2005). A note on the Cp statistic under the nested error 
regression model. Survey Methodology 31 105-109. 

[20] Miller, J. J. (1977). Asymptotic properties of mgiximum likelihood estimates in the 
mixed model of analysis of variance. Ann. Statist. 5 746-762. MR0448661 

[21] Nishii, R. (1984). Asymptotic properties of criteria for selection of variables in mul- 
tiple regression. Ann. Statist. 12 758-765. MR0740928 

[22] Owen, A. (2007). The pigeonhole bootstrap. Ann. Appl. Statist. 1 386-411. 

[23] Pebley, a. R., Goldman, N. and Rodriguez, G. (1996). Prenatal and delivery care 
and childhood immunization in Guatamala; do family and community matter? 
Demography 33 231-247. 

[24] Rao, J. N. K. (2003). Small Area Estimation. Wiley, New York. MR1953089 

[25] Rodriguez, G. and Goldman, N. (2001). Improved estimation procedure for mul- 
tilevel models with binary responses: A case-study. J. Roy. Statist. Soc. Ser. A 
164 339-355. MR1830703 

[26] SCHWARZ, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461-464. 
MR0468014 



26 



JIANG, RAO, GU AND NGUYEN 



[27] Shibata, R. (1984). Approximate efRciency of a selection procedure for the number 
of regression variables. Biometnka 71 43-49. MR0738324 

[28] Vaida, F. and Blanchard, S. (2005). Conditional Akaike information for mixed 
effects models. Biometrika 92 351-370. MR2201364 

[29] Ye, J. (1998). On measuring and correcting the effects of data mining and model 
selection. J. Amer. Statist. Assoc. 93 120-131. MR1614596 



J. Jiang 
T. Nguyen 

Department of Statistics 
University of California. Davis 
One Shields Ave 
Davis, California 95616 
USA 

E-MAIL: jiang@wald. ucdavis.edu 

tnguyenOwald. ucdavis.edu 

J. SuNiL Rao 
Department of Epidemiology 

and Biostatistics 
Case Western Reserve University 
10900 Euclid Ave 
Cleveland, Ohio 44106 
USA 

E-MAIL: suml@hal.cwru.edu 



Z. Gu 

ALZA Corporation 
1900 Charleston Road, Mll-4 
Mountain View, California 94039 
USA 

E-MAIL: zgu26@alzus.jnj.com 



